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In a large class of statistical inverse problems it is necessary to 
suppose that the transformation that is inverted is known. Although, 
in many applications, it is unrealistic to make this assumption, the 
problem is often insoluble without it. However, if additional data are 
available, then it is possible to estimate consistently the unknown er- 
ror density. Data are seldom available directly on the transformation, 
but repeated, or replicated, measurements increasingly are becoming 
available. Such data consist of "intrinsic" values that are measured 
several times, with errors that are generally independent. Working 
in this setting we treat the nonparametric deconvolution problems of 
density estimation with observation errors, and regression with errors 
in variables. We show that, even if the number of repeated measure- 
ments is quite small, it is possible for modified kernel estimators to 
achieve the same level of performance they would if the error distri- 
bution were known. Indeed, density and regression estimators can be 
constructed from replicated data so that they have the same first- 
order properties as conventional estimators in the known-error case, 
without any replication, but with sample size equal to the sum of 
the numbers of replicates. Practical methods for constructing esti- 
mators with these properties are suggested, involving empirical rules 
for smoothing-parameter choice. 

1. Introduction. Statistical deconvolution problems arise in a great many 
settings, and typically have the form g = T{f), where g is a function about 
which we have data, T is a transformation, and / = T~^{g) is a function we 
wish to estimate. In a large class of such problems, including density decon- 
volution and errors-in-variables regression, it is common to assume that T is 
known. Indeed, the nature of the data usually precludes any other approach. 
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In this paper we consider cases where there is a small number replications 
of each intrinsically different observation, the observation errors being inde- 
pendent and the intrinsic parts of the observations being the same among 
replicates. Data of this type are numerous, and increasingly are becoming 
available in various fields. Examples include work of Jaech (1985), who de- 
scribes an experiment where the concentration of uranium is measured for 
several fuel pellets; of Biemer et al. (1991), who discuss repeated observa- 
tions in a social science context; of Andersen, Bro and Brockhoff (2003), on 
nuclear magnetic reasonance; of Bland and Altman (1986), on lung func- 
tion; of Eliasziw et al. (1994), on physiotherapy for the knee; of Oman, 
Meir and Haim (1999), relating to kidney function; and of Dunn (1989), a 
brain-related study. For further medical examples, see Carroll, Ruppert and 
Stefanski (1995) and Dunn (2004). 

When data of this type are available, it is usually possible to construct 
consistent estimators of the function / of interest, without making para- 
metric assumptions about the transformation T. We treat both density de- 
convolution and errors-in-variables regression, focusing on cases where the 
convergence rate, and first-order properties more generally, are the same 
when the error distribution is known and when it is not known, but is esti- 
mated from repeated measurements. In Section 2 we construct a relatively 
simple density estimator and generalize it to the regression case. 

Theoretical properties of our estimators are taken up in Section 3. We 
show that a sufficient condition for first-order properties of estimators, in 
the cases of known and unknown error distributions, to be equivalent, is that, 
colloquially speaking, "the target density is smoother than half a derivative 
of the error density." Instances where this condition is violated are those 
where the convergence rate is relatively poor, even when the error density 
is known. 

We direct attention to examples where the number of replications of each 
observation is relatively small. (We use the terms "replications" and "re- 
peated measurements" synonymously.) In theoretical terms, this means that 
the number of replications is uniformly bounded. That is generally the case 
in practice, since gathering large numbers of replications is expensive in 
terms of time, effort or money. Moreover, particularly in cases where statis- 
tical performance is the same when the error density is known or unknown, 
it is seldom advantageous to have large numbers of replications. 

For instance, we show that if the total number of data is M = np, where 
p > 2 equals the number of times that each of n intrinsically different obser- 
vations is replicated, then first-order properties of nonparametric estimators 
depend only on M, not on the separate values of n and p. We prove this 
result rigorously when p is bounded, but a similar argument shows that it 
is also valid if p diverges sufficiently slowly as M increases. More generally, 
the result holds if M = Nj, where Nj is the number of replicates of the 
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jth intrinsically different observation. Properties of the estimator depend, 
to first order, only on M, provided that each Nj > 2. 

In Section 4 we develop an adaptive, data driven procedure for smoothing- 
parameter choice, and show that it enjoys good performance for real and 
simulated datasets. 

Related work in the context of density estimation includes that of Li 
and Vuong (1998), who derived upper bounds to convergence rates in the 
measurement-error problem when replications are present. Li and Vuong's 
results are important; they comprise some of the first contributions to den- 
sity deconvolution in cases where the error distribution is not known. Never- 
theless, the properties reported by Li and Vuong (1998), and bounds given 
also by Susko and Nadon (2002), are too coarse to permit it to be shown 
that convergence rates can be identical in the cases of known and unknown 
error distributions. Further discussion is given in Section 3.5. 

Recent, related research in the regression setting, and in the econometrics 
literature, includes that of Li (2002), Li and Hsiao (2004) and Schennach 
(2004a, 2004b), who demonstrated that replications can be used to good 
effect in regression problems with measurement error. See also the work of 
Horowitz and Markatou (1996) on error estimation from panel data, and 
the extensive literature, accessible through the work of Newey and Powell 
(2003), on inference in the context of instrumental variables. However, except 
in parametric contexts, this and related work is not sufficiently detailed 
to show that the convergence rates familiar in problems where the error 
distribution is known can also be enjoyed when the distribution is accessible 
only via repeated measurements. 

The problem of density estimation with unknown error density, estimated 
from a sample of the error, has been considered by Diggle and Hall (1993), 
Barry and Diggle (1995) and Neumann (1997). Madansky (1959), Carroll, 
Eltinge and Ruppert (1993) and Huang and Yang (2000), among others, 
have discussed linear regression with replicated data, when at least some of 
the predictors are measured with error. Early work on the problem of density 
deconvolution, under the assumption of known distribution of measurement 
error, includes that of Carroll and Hall (1988), Stefanski and Carroll (1990) 
and Fan (1991). More recent contributions, including surveys of earlier re- 
search, include the papers of Delaigle and Gijbels (2002, 2004) and van 
Es and Uh (2005). The literature on kernel methods for errors-in- variables 
regression is particularly large, and is surveyed by Carroll, Ruppert and 
Stefanski (1995). 

2. Models and methodology. 

2.1. Density deconvolution. Suppose we observe 
(2.1) Wjk = Xj + Ujk for 1< k<Nj andl<j <n, 
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where the random variables Xj are identicahy distributed as X , the UjkS 
are identically distributed as U, and the Xj^s and Ujk's are totally indepen- 
dent. We wish to estimate the density oi X. In the context of our discussion 
in Section 1, (2.1) indicates that there are n subsets of "intrinsically differ- 
ent" data and, within the jth of these subsets, Nj repeated, or replicated, 
measurements of the variable Xj . 

Let fu and fx denote the respective densities of U and X, and write 
and fx^ for the respective characteristic functions (i.e., the Fourier trans- 
forms of those densities). Provided that 

' — - ,^ 



(2.2/^* is real-valued and does not vanish at any point on the real line, 
a consistent estimator of fK^ is given by 



(2.3) flj\t) 



^ 1/2 

rrll E cos{t{W,k,-W,k,)} , 



N ^ 

j=i(fci,fc2)e5j 



where Sj denotes the set of ^Nj{Nj — 1) distinct pairs {ki,k2) with 1 < 
ki < k2 < Nj, N = N{n) = | J2j<n ^ji-^j ~ 1)' ignore values of j for 

which Nj = 1. Assumption (2.2) is conventional when using kernel methods 
for density deconvolution; see Stefanski and Carroll (1990) and Fan (1991), 
for example. 

An estimator of fx is given by 



Mh ^ 



' j=i k=i 



where M = ^j Nj, the weights Wj are nonnegative and satisfy ^jWjNj = 
M, 

K \s a, symmetric kernel function with compactly supported Fourier trans- 
form K^^, h>0 is a bandwidth, and p > is a ridge parameter. 

We introduce the ridge only so we can take expectation without concern 
for fluctuations of the denominator in the integral at (2.4). The ridge would 
not be necessary if our aim were to develop limit theory for fx that did not 
involve taking expected values. See Section 3.1 for discussion and theory in 
the case p = 0. 

If fu were known then, instead of fx, we would use the following gener- 
alization of the conventional deconvolution estimator: 



j=l k=l 
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[see, e.g., Carroll and Hall (1988)], where 

e-'*" dt. 



The bias of fx does not depend on choice of the weights, and it can readily 
be shown that the asymptotic variance is minimized by taking each Wj = 1. 
Optimality of this choice persists in the case of regression deconvolution, 
which we consider in Section 2.2. 

Therefore, we take each Wj = 1 in the work below. In particular, fx and 
fx henceforth denote the estimators 

]=lk=l 

and 

j=ik=i 

Section 3.3 demonstrates that fx is first-order equivalent to fx- For this 
result, and in the setting of "ordinary-smooth errors" [see (3.1)], the main 
assumption needed is that fx be sufficiently smooth relative to fjj- See 
condition (3.12). Properties of fx are summarized in Section 3.4. 

2.2. Errors-in-variahles regression. Here the model at (2.1) is extended, 
so that it addresses data iWjk,Yj) generated as 

(2.5) W,k = Xj + U,k. Yj=g{X,) + Vj, 

for 1 < k < Nj and 1 < j <n, 

where the Xj^s, Uj^s and Vj's are identically distributed as X, U and V, 
respectively, E{V) = 0, E{V'^) < DO, and the Xj^s, Uj^s and Vj^s are totally 
independent. We wish to estimate the function g. 
Define 

j=ik=i 

In the classical case, where fjj is known and each Nj = 1, the standard kernel 
estimator oi g \s g = a/ fx and, of course, g is also appropriate in the case 
of replicated data. 
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The intuition behind g is that a is a consistent estimator of the function 
CI' = fxO- When fjj is not known we can estimate a by a, and so we can 
modify g in the manner of Section 2.1, estimating g by g = a/ fx- We show 
in Section 3.6 that g is first-order equivalent to g. 

3. Theoretical properties. 

3.1. Density deconvolution. First we state assumptions. We ask that, for 
constants a > and i?i > 1, and all real t, 

(3.1) B^\l + < \fl\t)\ <B,{1 + \t\r^. 

This is often referred to as the case of ordinary-smooth errors. The impor- 
tance of the lower bound in (3.1), in addition to the upper bound (which is 
conventional when deriving convergence rates), is discussed in Section 3.3. 
Given /3, -B2 > 0, let !F{P,B2) denote the class of densities fx for which 

sup il + \t\f\f]^\t)\<B2. 
— oo<i<oo 

[The class jr(/?, S2) is a Fourier analogue of Fan's class Cm,a,B of functions; 
his m + a + 1 is our /?.] Let K have the property 

(3.2) sup l^'"'^! < 00 and, for some c> 0, K^\t) = for ah |t| > c. 

The kernels used in deconvolution commonly have this property, and so, 
while our results can be derived under weaker conditions, there is little 
motivation for that generalization. 

The theorem below gives an upper bound to pointwise mean-squared dis- 
tance between fx and fx, uniformly in all points and all densities fx G 
T{(3, C2). In Section 3.3 we use that result to show that, if the bandwidth h 
is chosen so that it gives optimal performance of fx, and if a relation (3.12) 
on the relative smoothnesses of fu and fx holds, then the difference between 
fx and fx is negligible relative to the distance between either estimator and 
the true density, fx- 

Theorem 3.1. Let Ci > 1 and C2,f3> 0. Assume that (i) 1 < iVj < Ci 
for each j; (ii) N(n) > Ci^n for each n > 1; (iii) f^^ satisfies (3.1); (iv) 
a > ^; (v) K^^ satisfies (3.2); (vi) hi{n) <h< h2{n), where h2{n) —)■ and, 
for some 6 > 0, n^^~^^^^"hi{n) is bounded away from zero; and (vii) cin"'^^ < 
P < C3 min{/ii(n)^""''^, where ci, 02,03 > 0. Then, for each integer k > 
I, 

(3.3) sup sup E{fx{x) - fx{x)}^ < const. pn, 

fx&^{P,C2) -oo<x<oo 
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where, for each integer k>l, 
(3.4) 

and t/ie constant in (3.3) depends on k but not on h £ [hi{n),h2{n)] or on 
n. 

Technical arguments are given in a longer version of this paper [Delaigle, 
Hall and Meister (2006)]. Theorem 3.1 remains correct without condition 
(i), that is, without the assumption that the Nj^s are bounded uniformly 
in j and n. However, if (i) is dropped, then the asymptotic properties of / 
cannot be discussed simply in terms of the size of M, and that difficulty 
hampers elucidation of our results. Indeed, if condition (i) is removed, then, 
depending on the size of the A'j 's, and on the frequency with which large 
Nj^s occur, properties of / can be very close to those of a standard kernel 
estimator based on the (unobservable) data Xj. In practice, the expense, in 
terms of time, effort or money, of making repeated measurements usually 
ensures that the Nj^s are relatively small, typically no more than 2 to 5, and 
so we shall retain condition (i). 

We argued in Section 2 that, if we were to develop a limit theory that did 
not involve taking expected values, the ridge parameter p could be taken 
equal to zero. In that setting we should replace uniform pointwise error, at 
(3.3), by error at a single point, or by a global metric such as integrated 
squared error. Otherwise, we incur a logarithmic penalty on the right-hand 
side of (3.3). [This is to be expected, since the same penalty arises in more 
conventional problems; see, e.g., Bickel and Rosenblatt (1973).] We should 
also remove the supremum over densities fx G ^{P-, C2)) since the uniformity 
implied by the supremum is not meaningful if we remove the expectation. 

For the sake of definiteness, when working with p = 0, we measure accu- 
racy in terms of squared error at a particular point, or integrated squared 
error. To treat the latter, note that (3.3) implies that, for each pair xi,X2 
for which —00 < xi < X2 < 00, 

(3.5) sup r E{fx{x)-fx{x)Ydx = 0{pn). 

Let fxi-"^) denote the version of fx constructed with p = 0. We claim that 
(3.5) continues to apply to f^, provided the expectation and supremum 
over fx are removed from the left-hand side, and the right-hand side is 
interpreted in an "in probability" sense. Moreover, squared error at each 
fixed point x converges at the same rate: 

\f'xix)-fx{x)\ = 0,ipl/'), 
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(3.6) 

r{Px{x)-fx{x)}^dx = Op{pn). 

J xi 

Theorem 3.2. Let Ci > let C2,a,P> 0, let -oo < xi < X2 < oo, and 
take p = in the definition of L, at (2.4), and hence, also in the definition of 
fx, obtaining the estimator f^- Assume that conditions (i)-(vi) in Theorem 
3.1 hold. Then (3.6) holds for each fx € C2), each x G (—00,00) and 
each pair xi,X2 for which — 00 < xi < 2:2 < 00. 

Results (3.6) and (3.10), below, show that optimal convergence rates can 
be achieved using a single smoothing parameter, the bandwidth, rather than 
two parameters, the bandwidth and ridge. 

3.2. Asymptotic optimality. The size of bandwidth that minimizes point- 
wise mean squared error, when using fx to estimate /x, is h ^ ho = 
72-i/{2{q+/3)-i}. and, for such a bandwidth, pointwise mean squared error 
of fx is of size Qnt where 

(3.7) g„ = n-2(/^-i)/{2("+/3)-i}. 

The same result holds if we replace fx by the errors-in-variables regression 
estimator, g, which we define in Section 3.6. See Fan (1991) and Fan and 
Truong (1993) for discussion of theory in these respective cases, and also for 
proofs of lower bounds which show that the rate qn is minimax optimal, in 
an L2 sense. 

However, these results address only the case where there is no replication, 
that is, each Nj = 1. In the case of upper bounds, generalization to settings 
where each Nj > 2 is relatively straightforward. See Section 3.4 for details. 
Below we generalize lower bounds in the setting of density deconvolution. 

Theorem 3.3. Assume that a,/3> ^. Let JF(/3,C) denote the class of 
densities fx defined in Section 3.1, and write J- for the class of all mea- 
surable functionals of the data. Assume that 2 < Nj < B for each j, where 
2 < B < 00. Then, for each fixed x and each sufficiently large C > 0, there 
exists D > such that, for all sufficiently large n, 

(3.8) inf sup Ef^{f{x)-fx{x)}^>Dqn. 

3.3. Equivalence of fx and fx- In view of the results given in Section 
3.2, and in order to establish that fx is asymptotically equivalent to fx 
when the latter is performing optimally, it is instructive to show that when 

(3.9) sup sup E{fx{x) - fxix)}"^ = o{qn), 

/xe:^(/3,C2) -oo<a;<oo 
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if the ridge-prameter p is taken to be nonzero; or, if the ridge is zero, that 
^^^^^ \Px{^)-fx{x)\=o,{ql/^), 

r{Px{x)-~fx{x)Ydx = Op{qn). 

Compare with (3.6). In fact, (3.9) and (3.10) follow from Theorems 3.1 and 
3.2, respectively, if we prove that 

(3.11) qn = o{pn)- 
Provided 

(3.12) /?>a + i, 

it is straightforward to show that if h^ho, then 

(3.13) 

and also that if k is sufficiently large and h>i Hq, then 72-'=/i-4(A:+2)q!-2 _ 
o{qn)- This result and (3.13) imply (3.11). 

Therefore, condition (3.12), which can be characterized colloquially as 
the assertion that "/x is smoother than half a derivative of /[/," is sufficient 
to ensure that, in deconvolution problems, there is no first-order loss of 
performance in using replicated data to estimate the error density when the 
latter is not known. Intuition behind (3.12) is given in Section 3.5. 

Of course, (3.12) fails if a is too large; that is, if fu is too smooth. This 
is the reason for placing the lower bound on |/^*(i)| in (3.1). Without that 
bound, fu can be arbitrarily smooth. It can be shown that if /? < a, then 
fx and fx are not asymptotically equivalent, and the minimax-optimal, 
pointwise convergence rate of an estimator of fx can be no faster than 
^-2(/3-i)/(4o-i) ^ ^}iich is strictly slower than the rate of convergence of fx 
to fx- However, the case where a < () < a + ^ is still unclear. 



3.4. Properties of fx- Let fx denote the "standard" version of fx-, ob- 
tained by taking Nj = 1 for each j, but with sample size M rather than n. 
Theorem 3.4, which is given below and is straightforward to derive, argues 
that the bias of fx is identical to that of fx, and that the variance of fx 
equals that of fx-, to first order. 

Recall that U and X have the distributions of Uj^ and Xj, respectively, 
that W = X + U, imd that N = \ Y.j<n ^ji^j - !)• Put 

mn{x) = / K{u)fx{x — hu)du, 
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'i^n(a^) = -rrlr / L{uf' fw{x — hu) du — mn{x^'^ 



M U 



w, 



2N {If 1 
i,(a;) = -^|- / K{uffx{x-hu)du-mn{xf>. 



Theorem 3.4. The mean and variance of fx{x) equal mn{x) andvn{x), 
respectively; the mean of fx{x) equals mn{x); and the variance of fx{x) 
equals Vn{x) +Wn{x). 

The quantity Wn is generally of strictly smaller order than since / 
remains fixed but / diverges as h decreases. Therefore, in terms of first- 
order properties of mean and variance, fx and fx have identical perfor- 
mance. In view of this property, and bearing in mind the asymptotic equiv- 
alence of fx and fx noted in Section 3.3, we can fairly say that: 

to first order, fx has the same properties as a conventional 
(3.14) deconvolution density estimator, computed when the error density 
is known and the sample size is M but without any replication. 

Of course, this assertion requires (3.9) and, hence, needs (3.12). 

Together, (3.8), (3.9) and (3.14) demonstrate minimax optimality of the 
estimator fx- Of course, this property necessitates the supremum being 
taken over fx in (3.9). That requirement motivated our introduction of the 
ridge parameter in our definition of fx- 

3.5. Discussion of different approaches to density deconvolution. Let (2.2)' 
denote the version of (2.2) where the assumption that is real-valued is 
omitted. For cases where (2.2)' holds but (2.2) fails, Li and Vuong (1998) 
suggest an estimator of f^^ quite different from our /^*. However, from a 
practical viewpoint, the condition that ffj^ be real-valued is mild. In partic- 
ular, in the nonparametric literature on density deconvolution and errors-in- 
variables regression where fu is assumed known, that quantity is invariably 
taken to be symmetric, in which case is real-valued. 

The alternative estimator suggested by Li and Vuong (1998) in the context 
of (2.2)' requires the distributions of both U and X to have characteristic 
functions that do not vanish anywhere (see Li and Vuong's condition A3) 
and also to be compactly supported (see their assumption A4). We are not 
aware of a distribution which enjoys both these properties. Certainly, none 
of the standard, compactly-supported distributions satisfy A3. This, and 
the numerical complexity of Li and Vuong's estimator, discouraged us from 
considering their technique. 
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If a is sufficiently less than /3, then the problem of estimating fu from 
the differences Wjfci — ^jk^ is more difficult statistically, although more 
straightforward numerically, than the problem of estimating fu from the 
raw data Wj^- This indicates why condition (3.12) is required. For values 
of a that are large relative to /3, alternative deconvolution methods may 
possibly give better theoretical performance, although we are not aware of 
any that are attractive computationally. 

3.6. Errors-in-variables regression. The results in this section are closely 
analogous to those in earlier sections, so we give only an outline. Recall from 
Section 2.2 that, under the model (2.5), our estimator of g is g = a/fx, 
where d is an estimator, defined at (2.6), of a = fxd- Properties of g follow 
directly from those of the numerator and denominator in the ratio aj fx- 
The denominator is treated in Theorems 3.1 and 3.2; here we address the 
numerator. 

Given fx G -^(Z?, C2), let G{P,C2\fx) denote the class of functions g for 
which 

sup (l + |tj)'^ 

— oo<t<oo 

Recall that conditions associated with the errors-in-variables model (2.5) 
include the assumption that E{V) = and E(V'^) < 00. 



itx 



fx{x)g{x) dx 



Theorem 3.5. Let Ci > 1 and C2,a,/J>0. Assume (i)-(vii) in Theo- 
rem 3. 1 . Then, for each integer k>l, 

(3.15) sup sup E {d{x) — d{x)}'^ < const. pni 

fx&Hli,C2),g&g{P,C2\fx) -oo<x<oo 

where pn is as at (3.4) and the constant in (3.15) depends on k but not on 
h € [hi{n),h2{n)] or on n. 



We know from Section 3.3 that, if a and /? satisfy (3.12), and if h is 
of the same size as the bandwidth that minimizes mean squared error of 

fx (this is also the size of the optimal bandwidth for d and g), then pn = 

1/2 

o{qn). [Recall that Qn is given by (3.7), and that qn equals the minimum 
order of magnitude of error for estimators of fx, 0, and g.] It then follows 
from Theorems 3.1 and 3.5, and (3.11), that if conditions (i)-(vii) hold, 
fx{x) — fx{x) = Op{ql/'^) and d{x) — d{x) = 0p{ql/'^). Therefore, provided 
fx{x) > 0, we have 

(3.16) g{x) = ^ = ^ + oM^') = g{x) + oMlJ')- 
fx{x) fx{x) 
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That is, if the bandwidth is chosen so that it is optimal for estimating g by 
g, then g is first-order equivalent to g. 

It is straightforward to state and prove the analogue of Theorem 3.4 for 
the estimator a instead of fx- This leads directly to the analogue of (3.14), 
where the only change necessary is to replace fx by gx and alter "density 
estimator" to "regression estimator." 

An argument similar to that used in Section 5 to derive Theorem 3.2 can 
be employed to show that (3.16) holds even if the ridge parameter, p, is 
taken as zero. Therefore, (3.14) applies in the ridge-free case. 

3.7. Supersmooth error case. All our discussion in the previous para- 
graphs was based on the assumption that the error distribution is ordinary 
smooth, and, in particular, satisfies (3.1). It is also of interest to treat the 
case of supersmooth errors, so named because there the error density is in- 
finitely differentiable. In that context the following condition is imposed in 
place of (3.1): for constants a > 0, 7 > and Bi> 1, and all real t, 

(3.17) B^'eM-l\tn<\fuit)\<BieM-l\tn- 
For such error distributions, pointwise mean squared error, when employ- 
ing fx to estimate fx, is of optimal order when using a bandwidth h = 
-D(logn)~^/", where D > (47)^/" denotes a constant. In this case, pointwise 
mean squared error of fx is of size qn = (logn)~^('^~^)/". Here, the rate 
of convergence of the estimator fx is so slow that the loss of performance 
incurred by estimating fu from the data, and using fx instead of fx, is 
negligible, regardless of restrictions such as (3.12). In particular, the follow- 
ing theorem holds. Its proof follows the lines of that of Theorem 3.1, but is 
more straightforward. 

Theorem 3.6. Let Ci > 1 and €2,0^,0, (3, j > 0. Assume that (i) 1 < 
< Ci for each j ; (ii) N(n) > Ci^n for each n > 1; (iii) fjj^ satisfies 
(3.17); (iv) K^^ satisfies (3.2) with c= 1; (v) h = D{\ogny^/'^ , with D > 
(47)^/"; and (vi) p = C2n~'^, with k,> j. Then, for some e > 0, 

sup sup E{fx{x) — fxix)}"^ < const. n"*^. 

/xeJ^(/3,C3)-oo<^<«2 

This result is readily generalized to the estimator g, provided h is chosen 
so that the optimal convergence rate for g as an estimator of g is attained. 
In particular, if /i = D(logn)~^/" where D > (47)^/°^, then g is first-order 
equivalent to g. 

4. Numerical properties. 

4.1. Simulated examples. We study numerical properties of the estima- 
tors fx and g in several simulated examples. In the density case, and follow- 
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ing model (2.1), we generate 500 random samples of replicated observations 
for n individuals, Wij, where i = l,...,n and j = l,...,Ni. We take the 
noise-to-signal ratio CT]j/a\ equal to 25%, except in the case of density (iii) 
below, where we take afj/a\ = 10%. The notation a"^ denotes the variance 
of a random variable T. The error density fu is chosen to be a Laplace or 
a centered normal density. In each instance where the first of these choices 
is used, (3.12) is satisfied; the second choice corresponds to a supersmooth 
density, and there (3.12) is not relevant. 

We consider four target densities fx- (i) X~0.5N(-3,1) +0.5N(2,1), 
(ii) X ~ x^(3), (iii) X ~ Ef=o(2^"763)N{65 - 96 x 2-72I, (32/63)2/2^^} 
and (iv) X~N(0,1). Density (i) is bimodal and symmetric, density (ii) 
is asymmetric and density (iii) is the smooth comb density discussed by 
Marron and Wand (1992). Note that, even in the error-free case, the latter 
density is particularly hard to estimate because of its numerous features. 

In the regression case we generate 500 datasets of randomly-sampled vec- 
tors (Wjj, Ii), i = 1, . . . , n, j = 1, . . . , Ni, according to the model (2.5). The 
density fx is chosen to be a uniform f7[0, 1] or a normal N(0.5,(j|^) density, 
with cj\ chosen so that and 1 are respectively the 0.025 and 0.975 quantiles 
of fx ■ The error density fu is a Laplace or centered normal density, and the 
noise-to-signal ratio crfj/aj^ equals 10%. Except for our Bernoulli regression 
example [see case (iii) below] , the error density fv is a centered normal den- 
sity such that the noise-to-signal ratio ay/a'^{g) equals 10%, where cr'^{g) 
denotes the mean squared deviation of g from its average value. 

We consider three regression curves: (i) g{x) = ^^(l — x)^, (ii) g{x) = 
3x + 20(27r)-i/2 exp{-100(x - i)^}, (iii) Y\X = x^ Bernouni{c?(x)}, with 
g{x) = 0.45sin(27rx) + 0.5. Note that curve (i) is unimodal and symmetric 
around 0.5, curve (ii) is a mixture of a straight line and an exponential curve, 
and curve (iii) is an asymmetric sinusoid. 

We sought an automatic way of choosing the bandwidth, h. In the density 
case, we suggest using hpi, the plug-in bandwidth of Delaigle and Gijbels 
(2002, 2004), where the characteristic function of the error is replaced by 
(2.3). This procedure is justified by the discussion in Section 3.3. In the 
regression case, a bandwidth-choice procedure could also be based on a data- 
driven selector for the known error case. However, since, to our knowledge, 
there does not exist such a method, we must first propose one. 

A cross-validation (CV) criterion for selecting h would choose 

where, for j = 1, . . . , n, 

1=1 ^ ^' j=ii=i \ / 
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Since the observations are not available, we need to replace all quantities 
of the form 

^{ ^'~h^'' ) =hl ^M-itX,/h)eM^tW,e/h)^^^^dt, 

by empirical estimators. We suggest replacing exp{—itX}^/h) by an estimator 
of its expected value, f^{—t/h), based on the replications of the kth. intrin- 
sic observation. Such an estimator can be defined by fy^{—t/h)/ffj^{—t/h), 

where fy/{t) = J^^*(/it) X^mLi exp(iiWfcm) is a kernel estimator of /^*. Pro- 
ceeding that way, our CV criterion becomes 

,,,, r . " /n^E2=ii^^g™ f 

(4.1) /icv = argmm> -- , 

h fc=iV i-Sk{Xk) ) 

where 

,4.2) s7x.,--th^(^^)lttY.^i^^^\ 

m=l £=l J=lm=l£=l 

with L2{x) = i2TTy^ J exp{-itx/h)\K^\t)\^\fl\t/h)\~^ dt. 

In the case of unknown error density, we define /icv as in (4.1) but we 
replace L2 in (4.2) by 

L2{x) = {27Tr'J eM-itx/h)\K^'{t)\^\g\t/h)\-'' dt, 

with fij^{t) as in (2.3). As in the error-free case, the computations needed 
to calculate this bandwidth can be reduced considerably by binning the 
data. See, for example. Fan and Gijbels (1996), page 96. We suggest placing 
the Wij^s into 200 equi-spaced bins between their empirical 0.025 and 0.975 
quantiles. 

The selection of a ridge parameter can be avoided if, instead of using 
ffj\t) +pmL, we employ /^^(t) = fi;\t)I{t G ^) + fP{t)I{t i A), where 
A denotes the largest interval around in which fjj"{i) is nonincreasing to 
the left of and nonincreasing to the right (excepting fluctuations very close 
to 0), and /p*(t) is a parametric function estimated from the observations 
and defined by /p*(t) = (1 + ^c/t^)"-^'^, with Ay and By chosen so as to 
match the empirical second and fourth moments of the error with those of 
/p. In the event that these moments are negative, we set By = 1 and take 
Ajj equal to half the empirical variance of the error, which corresponds to 
fp being a Laplace density. This method gives very good results in practice, 
sometimes even better than in the case of known error density. It is designed 
specifically for the comparatively small samples that typically arise in errors- 
in- variables regression with repeated measurements. The small sample sizes 
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there are typically a consequence of the relatively high cost, in terms of time, 
effort or money, of making several observations of the same X, compared 
with making the same number of observations of different X's. 

In our simulations we consider samples of sizes n = 50, 100 and 250, and 
fix the number of replications, Nj, at 2 or 4. In each case we generate 500 
datasets, for each of which we calculate an estimate of the target curve by us- 
ing the bandwidth hpi (density case) or the bandwidth hcv (regression case). 
We take K^^ = (1 — t'^)^I{t E [—1,1]); this kernel is commonly used in de- 
convolution problems. To evaluate performance, we calculate the integrated 
squared error (ISE) distance of each estimate, where ISE = /j(m — m)^, 
with m = fx or m = g, and where / is the whole real line (density case) 
or / = [0, 1] (regression case). In the graphs we present the three estimates 
that resulted in the first, second and third quartiles of the 500 calculated 
ISEs, and we denote them by, respectively, qi, q2 and q^. We report only 
part of the simulations, although our conclusions are similar for the other, 
nonreported results. 

Table 1 illustrates the effect of increasing the number of replications by 
comparing the median and the inter-quartile range (IQR) of the calculated 
ISEs, for Nj = 2 and Nj = 4, obtained from 500 samples from density (i) con- 
taminated by Laplace or normal errors when M = 200 or M = 500. These 
and related results indicate better performance when Nj = 2 than when 
Nj = 4. As suggested in the introduction, for the same total number of ob- 
servations, M, it is more advantageous to have a large number of intrinsically 
different observations, n, than a large number of replications, Nj. 

In Figure 1 we show the quartile curves obtained for 500 samples from 
density (iii) contaminated by Laplace error when n = 250, with Nj = 2, 
together with boxplots of the calculated ISEs for n = 50, 100 and 250 in the 
known and unknown error cases. The results show that, as in the error- free 
case, it is difficult to recover all the modes of this density. They also illustrate 
the fact that knowing the error density brings only minor improvements, 
which we also observed in our non reported simulated results. In some of 
the non-reported cases, the results were even better for fx than for fx- 

Figure 2 shows the quartile curves obtained from 500 samples in the case 
of regression function (i) for n = 100 and Nj =2, when the error U is normal 



Table 1 

Values o/ median x 100 (IQR x 100) of the ISE for density (i), when M = 200 or 
M — 500, with Nj —2 or Nj — 4 and Laplace or normal errors 



iNj,M) 


(2, 200) 


(4, 200) 


(2, 500) 


(4, 500) 


U ^ Lap 


1.41 (0.94) 


1.56 (0.98) 


0.89 (0.51) 


0.96 (0.58) 


U ~ Norm 


2.09 (1.33) 


2.31 (1.43) 


1.42 (0.92) 


1.55 (1.02) 
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Fig. 1. Quartile curves of 500 estimates fx of density (iii) in the Laplace error case, for 
Nj — 2 and n = 250 {left panel), together with boxplots [right panel) of the 500 calculated 
ISEs when n — 50, 100 or n — 250. In each group of two boxplots, the first is for fx and 
the second for fx ■ 

and X ~ U[0, 1]. We also show boxplots of the 500 calculated ISEs in the 
case of Laplace and normal error U and n = 100 or 250, using g (unknown 
error) or g (known error). We see that the estimated curves are quite good 
and the results are slightly better when the error density is known. 

Finally, Figure 3 shows the quartile curves in the case of regression curve 
(iii), when the error U is Laplace, X ~ C/[0, 1], Nj = 2 and n = 100 or 250. In 
this case, too, we see that the results are quite good and improve as sample 
size increases. 

4.2. Real-data examples. We apply our methods to two medical exam- 
ples. The first dataset, described by Bland and Altman (1986), was collected 
to compare two methods for measuring peak expiratory flow rate (PEFR). 
Two replicated measurements of PEFR were made on 17 individuals, using 
each of two different methods: a Wright peak flow meter and a mini Wright 



Fig. 2. Quartile curves of 500 estimates g of the regression function (i) in the normal 
error case for Nj = 2, n= 100 and X ~ U[0, 1] {left panel); and boxplots of 500 ISEs for 
the same regression curve in the case of Laplace error {first group of four) or normal error 
{last group of four), for n = 100 or 250 {right panel). In each group of two boxplots, the 
first is for g and the second is for g. 
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Fig. 3. Quartile curves of 500 estimates g of the regression function (iii) m the Laplace 
error case for Nj — 2, X U[0, 1] and n = 100 {left panel) or n = 250 {right panel). 



meter. As described by Bland and Altman (1986), when evaluating a new 
method for measuring a clinical quantity, usually the true values remain 
unknown and a common practice is to compare the new method with the 
established method, rather than with the true quantities. The goal is thus 
to check whether the mini meter and the Wright meter are in agreement. 

To this end, we define Xi as the average of all possible readings on the 
mini meter for individual i, and define Yi similarly for the "regular" Wright 
meter. The latter gives more stable (less variable) readings than the mini 
meter, and, therefore, for each individual i, we set Yi equal to the average 
of the two Wright readings. Since readings from the mini meter are more 
variable, then there we need to incorporate measurement errors. For j = 1, 2, 
we take Wij to be the jth replicated mini Wright measurement. 

The regression estimate (dashed line) is depicted in the left panel of Figure 
4, together with the Nadaraya-Watson estimate of g (dotted line) that uses 
the original data (and hence, ignores the error U) and the data {Wij,Yi). The 
unusual shape of the dashed line, deviant from a straight line, suggests that 
the two PERF measurement methods might not be in good agreement and 
that further investigation should be carried out. Bland and Altman (1986) 
note that a standard parametric analysis of these data, not taking the noise 
into account, indicates agreement between the two methods. Analogously, 
the dotted line shows that ignoring the measurement error results in an 
estimate that oversmoothes the data, and which lies closer to, although still 
far from, a straight line. For example, the steeper climb of the dashed line 
in the upper right-hand part of the graph, and the flatter nature of that line 
after the climb (compared to the dotted line), add weight to the argument 
that the results from the mini Wright meter, in the 600-700 range, represent 
stochastic variation of the relatively constant measurements obtained using 
the Wright peak flow meter. 

The second dataset concerns two replicated measurements derived from 
CAT scans of the heads of 50 psychiatric patients. More precisely, the 
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Fig. 4. Regression estimate for the PEFR data {left panel) and density estimate for the 
CAT data {right panel). 

ventricule-brain ratio (VBR) was measured twice for each patient, using a 
hand-held planimeter. See Turner, Toone and Brett-Jones (1986) and Dunn 
(2004). The logarithm of the VBR can be described by model (2.1), and 
for the ith patient we set Wij = log(VBRjj), j = 1,2, where VBRjj denotes 
the jth contaminated replication of the measurement of VBR for patient 
i. The density estimate of the noncontaminated log VBR is plotted as the 
dashed line in Figure 4, and represents a smooth and symmetric density. 
We also show, in the dotted line, the kernel density estimate that ignores 
measurement error. The second estimate is essentially a smoothed version 
of the density shown by the dashed line, modulo Gibbs-phenomenon wiggles 
in the tails of the latter. 

5. Outlines of technical arguments. Details of proofs, and a derivation 
of Theorem 3.3, are given by Delaigle, Hall and Meister (2006). Without loss 
of generality, c = 1 in (3.2). 

5.1. Outline proof of Theorem 3.1. Put ■0 = (f) = il)'^ and 

= ^ E E [cos{t(w^,fc, - w,u,)} - mi 

i=i(fci,fc2)e5j 

In this notation, 

k 

(5.1) UP + p)-^ = V'-i/(V' > P) + E qV'-''-' + XI + X2, 

£=1 

where the constants q are derived from binomial coefficients, |xi| < p~^I{i]j < 
P), 

|;^2|<const.(^(V-'|A|+V'-(''=+i)|A|'=) + V-(''+')|A|'=+i 

+ pV'"'/(V' > p)+p-'i(\A\ > ^AV 



DECONVOLUTION 



19 



and "const.," here and below, denotes a generic positive constant depending 
only on k, fu and the parameters a and C2 of C2). 
Result (5.1) implies that 

k 

(5.2) fx{x) - fx{x) = ^Ci5u{x) + 5qi{x) + 5q2{x) - 52{x), 

1=1 

where, for ^ = 1,2 in the case of Sqi, and 1 <i <k for 5i£, 
5,i{x) = ^J e-''-f^{t)xi{t)K^\ht)dt, 

Su{x) = ^J e-''^f^\t)m-''-'m'K^\ht)dt, 

Hx) = ^ I e-^'-f^\mtr'K^\ht)i{m < p}dt 



27r 



jtw. 



and/Ft(t) = M-iE,Efc 

It can be proved that, for a constant no > 1, the functions Jqi find 82 
vanish identically whenever n > no. Therefore, assuming n > no, we deduce 
from (5.2) that 

k 

fx{x) - fx{x) =^C£6u{x) +5o2{x). 
e=i 

This formula and the fact that = t/jf]^^ + Ai, where Ai = - E{f^), 
imply that 

sup E{fx{x) - fx{x)}^ 

— oo<x<oo 

(5.3) 



< const. 

where 



max max sup E{6re{x)^} + sup E{6o2{x)'^} 

r=2,3 l<£<fc _oo<a;<oo — oo<a;<oo 



S2iix) = e-''-fl\t)m-^'m'K''\ht)dt, 



27r 

Lengthy arguments can be used to show that 
max max sup E{6re{x)'^} 

r=2,3 l<e<k -oo<x<<x> 

< const. [n-^/i''-'"-' + + (logn)2} 

+ n-2(/l2(/3-4")-2 + /,-6a-l) + ^^k ^2f3~4ka-2 

+ n"(^+^)/i~2(^^+^)"^^] 

= 0{pn) 
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and snp^ E{6o2{x)'^} = 0{pn)- Together, these bounds and (5.3) imply (3.3). 

5.2. Outline proof of Theorem 3.3. For brevity we derive only the second 
part of (3.6). Since \{fl\t) + p}-^ - f^^ty^] < p/ f^^tf , then 

(5.4) \L{u)-L\u)\<^j g\tr'K^\ht)dt, 

where denotes the version of L constructed with p = 0. With probability 
7r„, say, equal to 1 - 0{n~^) for each 5 > 0, \ fij\tf < fu^{tf for all t such 
that the integrand at (5.4) does not vanish. Therefore, with probability at 

least TTn, 

sup \L{u)-L^{u)\<^^^^^ f^^'' {l + \t\f"dt<C3ph-^", 

—oo<u<oo J — i/fi 

where s = sup \K^^\ and C3 > 0. Hence, with probability at least 7r„, 

sup \f^{x)-f'^ix)\<C3n~\ 

— oo<a;<oo 

which leads to the second part of (3.6). 
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